Why .interpolate_na dosen't work well ?
-
Hello,
I'd like to fill na in data by using .interpolate_na() like df.interpolate().
Best regards,
import xarray as xr import qnt.ta as qnta import qnt.data as qndata import qnt.output as qnout import qnt.stats as qns data = qndata.stocks.load_ndx_data(min_date="2005-06-01") close = data.sel(field='close') it = close.sel(asset=['NAS:AAPL','NAS:ADBE','NAS:ADI','NAS:ADSK','NAS:AKAM', 'NAS:AMAT','NAS:AMD','NAS:AMZN','NAS:ANSS','NAS:ASML', 'NAS:ATVI','NAS:BIDU','NAS:CDNS','NAS:CDW','NAS:CERN', 'NAS:CHKP','NAS:CMCSA','NAS:CSCO','NAS:CTSH','NAS:CTXS', 'NAS:DISCA','NAS:DISCK','NAS:DISH','NAS:EA','NAS:EBAY', 'NAS:ERIC','NAS:EXPE','NAS:FB','NAS:FFIV','NAS:FISV', 'NAS:FLEX','NAS:FLIR','NAS:FTNT','NAS:INTC','NAS:INTU', 'NAS:JD','NAS:KLAC','NAS:LBTYA','NAS:LBTYK','NAS:LOGI', 'NAS:LRCX','NAS:MCHP','NAS:MELI','NAS:META','NAS:MSFT', 'NAS:MU','NAS:MXIM','NAS:NFLX','NAS:NTAP','NAS:NTES', 'NAS:NUAN','NAS:NVDA','NAS:NXPI','NAS:PANW','NAS:PAYX', 'NAS:QCOM','NAS:SIRI','NAS:SNPS','NAS:SPLK','NAS:SWKS', 'NAS:TIGO','NAS:TMUS','NAS:TRIP','NAS:TTWO','NAS:TXN', 'NAS:VOD','NAS:VRSN','NAS:WDAY','NAS:WDC','NAS:XLNX', 'NYS:BB','NYS:INFY','NYS:JNPR','NYS:ORCL',]) it = it.interpolate_na(dim = 'time',method='linear') it
-
Ok, can you let us know more details? What is the exact problem? Best regards
-
After I applied .dropna() to 'it', some information related with the time from '2005-06-01' is deleted regardless of interpolating nan.
What make this happen do you think ?
-
@cyan-gloom
interpolate_na() only eliminates NaNs between 2 valid data points. Take a look at this example:import qnt.data as qndata import numpy as np stocks = qndata.stocks_load_ndx_data() sample = stocks[:, -5:, -6:] # The latest 5 dates for the last 6 assets print(sample.sel(field='close').to_pandas()) """ asset NYS:NCLH NYS:ORCL NYS:PRGO NYS:QGEN NYS:RHT NYS:TEVA time 2023-05-12 13.24 97.85 35.21 45.09 NaN 8.03 2023-05-15 13.71 97.26 34.23 45.36 NaN 8.07 2023-05-16 13.48 98.25 32.84 45.25 NaN 8.13 2023-05-17 14.35 99.77 32.86 44.95 NaN 8.13 2023-05-18 14.53 102.34 33.43 44.92 NaN 8.26 """ # Let's add some more NaN values: sample.values[3, (1,3), 0] = np.nan sample.values[3, 1:4, 1] = np.nan sample.values[3, :2, 2] = np.nan sample.values[3, 2:, 3] = np.nan sample.values[3, :-1, 5] = np.nan print(sample.sel(field='close').to_pandas()) """ asset NYS:NCLH NYS:ORCL NYS:PRGO NYS:QGEN NYS:RHT NYS:TEVA time 2023-05-12 13.24 97.85 NaN 45.09 NaN NaN 2023-05-15 NaN NaN NaN 45.36 NaN NaN 2023-05-16 13.48 NaN 32.84 NaN NaN NaN 2023-05-17 NaN NaN 32.86 NaN NaN NaN 2023-05-18 14.53 102.34 33.43 NaN NaN 8.26 """ # Interpolate the NaN values: print(sample.interpolate_na('time').sel(field='close').to_pandas()) """ asset NYS:NCLH NYS:ORCL NYS:PRGO NYS:QGEN NYS:RHT NYS:TEVA time 2023-05-12 13.240 97.850000 NaN 45.09 NaN NaN 2023-05-15 13.420 100.095000 NaN 45.36 NaN NaN 2023-05-16 13.480 100.843333 32.84 NaN NaN NaN 2023-05-17 14.005 101.591667 32.86 NaN NaN NaN 2023-05-18 14.530 102.340000 33.43 NaN NaN 8.26 """
As you can see, only the NaNs in the first 2 columns are being replaced. The others remain untouched and might be dropped when you use dropna().
Another thing you should keep in mind is that you might introduce lookahead bias with interpoloation, e. g. in a single run backtest. In my example for instance (pretend the NaNs I added were already in the data) you would know on 2023-05-15 that ORCL will rise when in reality you would first know that on 2023-05-18.
-
I got it !
Thanks a lot !! -
This post is deleted!